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Abstract 

We consider the problem of learning a low-dimensional signal model from a collection of training samples. The 
mainstream approach would be to learn an overcomplete dictionary to provide good approximations of the training 
samples using sparse synthesis coefficients. This famous sparse model has a less well known counteipart, in analysis 
form, called the cosparse analysis model. In this new model, signals are characterised by their parsimony in a 
transformed domain using an overcomplete (linear) analysis operator. We propose to learn an analysis operator from 
a training coipus using a constrained optimisation framework based on LI optimisation. The reason for introducing 
a constraint in the optimisation framework is to exclude trivial solutions. Although there is no final answer here for 
which constraint is the most relevant constraint, we investigate some conventional constraints in the model adaptation 
field and use the uniformly normalised tight frame (UNTF) for this purpose. We then derive a practical learning 
algorithm, based on projected subgradients and Douglas-Rachford splitting technique, and demonstrate its ability to 
robustly recover a ground truth analysis operator, when provided with a clean training set, of sufficient size. We also 
find an analysis operator for images, using some noisy cosparse signals, which is indeed a more realistic experiment. 
As the derived optimisation problem is not a convex program, we often find a local minimum using such variational 
methods. For two different settings, we provide preliminary theoretical support for the well-posedness of the learning 
problem, which can be practically used to test the local identifiability conditions of learnt operators. 

Index Terms 

Sparse Representations, Low-dimensional Signal Model, Analysis Sparsity, Cosparsity, Dictionary Learning 

I. Introduction 

Many natural signals can be modelled using low-dimensional structures. This means, although the investigated 
signals are distributed in a high dimensional space, they can be represented using just a few parameters in an appro- 
priate model JT). This is made possible by the fact that few parameters are actually involved in generating/describing 
such signals. Examples include polyphonic music, speech signals, cartoon images (where some lines/edges describe 
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the images) and standard videos, where the scenes change slightly frame to frame. In modern signal processing, 
low-dimensional modelling is one of the most effective approaches to clarify the ambiguities in inverse problems, 
regularise the signals and overcome some conventional physical, temporal and computational barriers. This model 
has been heavily investigated in the last decade and many remarkable results achieved in almost all signal processing 
applications, see for example J2], J3|. 

A natural and crucial question immediately arises. Given a class of signals that we are interested in, how can 
we model its low-dimensional structures? As an abstract answer, one could imagine that if X C K™ is the signal 
class with low-dimensional structures, then there must be a map T : K" — > R a such that T(x) is sparse for all 
x £ X. Unfortunately, finding such T out of all possible maps would be too complex, and one would have to 
restrict admissible class of maps. A simple option is to consider the class of linear maps. Hence, we ask: is there 
$1 £ R ax " such that fix is sparse for x £ XI This is the type of problem we focus on in this paper. Before getting 
to the nitty-gritty, we review relevant results to give some contexts to our approach. 

A. Sparse Signal Models 

The most familiar type of low-dimensional structure is the sparse synthesis model J4), ]5). In this setting, if a 
signal x £ M" can approximately be generated by adding just a few elementary vectors {d^}, called atoms, from 
an overcomplete set, called a dictionary, as follows, 



where D £ M. nxp is the dictionary (matrix) and I is an index set with small cardinality, i.e. \X\ <C n, it is called 
(approximately) sparse in D. From the signal processing perspective, it is then important to identify the support 
X. However, this is not an easy task (6), Q: when the signals are at most fc-sparse — this means \I\ < k — in 
model ((TJ, they live in a union of subspaces (UoS) model (8), 60, where each subspace has a dimension of at 
most k, and the number of such subspaces is exponential. A popular and practical approach to find a sparse v is 
the £i-minimisation [5 | in which one looks for v that minimises ||v||i among the ones that satisfy x w Dv. The 
attractiveness of the ^-objective comes from the fact that it promotes sparsity and is convex — good in computational 
aspects — at the same time. 

An alternative form of £i-minimisation has been used in practice. In this form, one looks for x that minimises 
j|Jlx||i among the signals that matches available (linear) information. Here, ft £ R axn is some known linear 
operator. Note that a solution of this £i-minimisation necessarily leads to many zeros in Ox and hence is orthogonal 
to many rows of fi (see Figure Q]a). This observation is not exactly compatible with the synthesis interpretation 
above, and the corresponding low-dimensional model is called the cosparse analysis model A signal x in 
this model is called cosparse. l~i is referred to as the analysis operator in this paper. Although this model is also 
an instance of the UoS model and relies on the vector form of parsimony, it has many structural differences with 
the synthesis sparsity model lfl2l . ATI . Very recently the signal recovery using analysis sparsity model has been 
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investigated |fl3l , ifTTl . (14*1 . As can be deduced from the expression Ox, the cosparse analysis model is intimately 
related to our work. 

B. Model Adaptation 

As can be seen from the synthesis model description, the dictionary D plays an important role. When we deal 
with the natural signals, one can then naturally ask, how good we can choose such a dictionary? Generally, we 
can use some signal exemplars JT3], lfl6l for learning a dictionary or some domain knowledge for designing a 
suitable dictionary JT7), [1181 . The readers can refer to lTT9ll and ll20l as some surveys on the dictionary selection 
approaches. It has been shown that the learned dictionaries usually out perform the canonical dictionaries, e.g. DCT 
and Wavelet, in practice. Online dictionary learning methods 12TI have also been introduced to adapt the generative 
model to a large set of data or a time-varying system. 

In the exemplar based dictionary learning methods, a set of training signals X £ M. nxl is given. Similar to sparse 
approximations, the t\ norm is often preferred as the sparsity-promoting objective lfl6l . |22l . and the dictionary 
learning with this objective can be formulated as follows: 

min ||V||i + ^||X-DV|||., (2) 

v,Dei) 2 

where || ■ j|i = . and T> is an admissible set. Here, D £ M. nxp is the dictionary we want to learn and 

V € M. pxl is the coefficient matrix which is expected to be sparse column-wise. The reason for T> is to exclude 
trivial solutions, such as X = DV, with V —> and D — > oo. This type of problem is called the scale ambiguity of 
the signal modelling. Various constraints for the resolution of scale ambiguity have been proposed for dictionaries: 
fixed atom norms l23l . lfl"5l . fixed Frobenius norm 11241 and convex balls made using these norms 11251 . 11261 . lETl . 
Note that the optimisation (ffl is no longer convex and can be difficult to solve. Different techniques have been 
proposed to find a sensible solution for (|2), where they are often based on alternating optimisations of the objective 
based upon V and D. See Ifl9l and |20l for a review on state of the art methods. 

Designing (overcomplete) linear transforms to map some classes of signals to another space with a few significant 
elements, is not a new subject. It has actually been investigated for almost a decade and many harmonic/wavelet 
type transforms have been designed with some guarantees on fast decay of coefficients in the transform domain 
(See for example 11271 and 11281 ). These transforms are designed such that the perfect reconstructions of the signals 
are possible, i.e. bijective mapping. 

Relatively few research results have been reported in the exemplar based adaptation of analysis operators. This 
is an important part of data modelling, which has the potential to improve the performance of many current signal 
processing applications. 

C. Contributions 

This paper investigates the problem of learning an analysis operator from a set of exemplars, for cosparse signal 
modelling. Our approach to analysis operator learning can be viewed as the counterpart to the optimisation (O in 
the analysis model setting. Here is the summary of our contributions in this work: 
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1) We propose to formulate analysis operator learning (AOL) as a constrained optimisation problem. Perhaps 
surprisingly, in contrast to its synthesis equivalent — the dictionary learning problem — the natural formulation 
of the analysis operator learning problem as an optimisation problem has trivial solutions even when scaling 
issues are dealt with. The constraints presented in this report exclude such trivial solutions but are also 
compatible with a number of conventional analysis operators, e.g. curvelets (23 and wave-atoms 1281 . giving 
further support for our proposal. 

2) We provide a practical optimisation approach for AOL problem, based upon projected subgradient method. 
Clearly, an AOL principle which does not permit computationally feasible algorithm is of limited value. By 
implementing a practical algorithm, we also open the possibility to cope with large scale problems. 

3) We give preliminary theoretical results toward a characterisation of the local optimality conditions in the 
proposed constrained optimisation problem. This helps us to better understand the nature of the admissible 
operator set. It also has potential use in the initialisation of our AOL problem. 

Our approach is developed from the idea that was primarily suggested in 1291 and ll30l . 

D. Related Work 

Remarkably, even though the concept of cosparse analysis modelling has only been very recently pointed out as 
distinct from the more standard synthesis sparse model ifPTl . PP . it is already gaining momentum and a few other 
approaches have already been proposed to learn analysis operators. The most important challenge in formulation 
of the AOL as an optimisation problem is to avoid various trivial solutions. Ophir et al. used a random cycling 
approach to statistically avoiding the optimisation problem becoming trapped in a trivial solution ||3"21 . Peyre et 
al. used a geometric constraint, using a linearisation approach, in the operator update step, to get a sensible local 
optimum for the problem 11331 . In a recent approach, Elad et al. have introduced a K-SVD type approach to update 
each row of the operator at a time 11341 . 11351 . While these approaches have shown promising empirical results, a 
specificity of our approach is its explicit expression as an optimisation problemQ We expect that this will open the 
door to a better understanding of the AOL problem, through mathematical characterisations of the optima of the 
considered cost function. 

E. Notation and Terminology 

In this paper we generally use bold letters for vectors and bold capital letters for matrices. We have presented 
a list of the most frequent parameters in Table [] We have also presented the corresponding parameters and their 
range-spaces, used in the related papers, in the same table. The notation ( • or simply subscript jj, has been 
used to specify the element locating in the i th row and j th column of the operand matrix. || ■ ||i, |j • ||2 and || • || F are 
respectively l\, £2 and Frobenius norms. With an abuse of notations, we will use || • ||i for the norm defined by the 

'Peyre and Fadili 1331 use a similar type of expression to learn an analysis operator in the context of signal denoising, with exemplars of 
noisy and clean signals. 
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TABLE I 

The notation of current and their corresponding notations in related papers. 



paper PF(33l 

(OEBP(32l) 



Observation 


ye: 


R™ 




y G R d 


Signal 


x g : 


R n 


x g R N 


x G M d 


Analysis vector 


z e i 


R a 






Dictionary 


Del 


nXp 


D g K JVxP 


D G K dxn 


Synthesis vector 


v g 


w 


u e R p 


z G K" 


Analysis operator 


del 


a X n 


D* GM PxJV 


f2 G M pxd 


Training size 


l 




?i 


H(JV) 


Cosparsity 


g 






I 


Sparsity 


k 






k 



entrywise sum of absolute values of the operand matrix, which is different from the l\ operator norm for matrices. 
The notation ( , ) represents the canonical inner-product of two operands for, respectively, £2 and Frobenius norms, 
in the vector-valued and matrix-valued vector spaces. tr{-} denotes the trace of the operand matrix. The notation 
V will be used to represent the orthogonal projection whose range is specified by the subscript, e.g. V VN . Sub- and 
super- scripts in braces are used to indicate iteration number in the algorithm section. 

The sparsity of x in D is noted here by k, which is the cardinality of the support of v, where v is the sparse 
representation of x. We can similarly define the cosparsity of x 11311 . with respect to 1~2, and note it by q, which is 
the cardinality of its co-support, i.e. co-support(x) = {i : 1 < i < a, {fix},; = 0}. 

Following the notation of [36], a sub-indexed vector or matrix will be determined using a subscript for the 
original parameter, e.g. fi A . Here, O a has the dimension of O, identical values as O in its support A and zeros 
elsewhere. We also use "bar", e.g. A, to denote a complement of the index set A in this context. 

F. Organisation of the Paper 

In the following section, we formulate the analysis operator learning problem as a constrained optimisation 
problem. We briefly review some of canonical constraints for the exemplar based learning frameworks and explain 
why they are not enough, we introduce a new constraint to make the problem 'well-posed' in III-B4I After the 
formulation of the AOL optimisation problem, we introduce a practical algorithm to find a sub-optimal solution 
for the problem in section [III] Following the algorithm, in Section [V] we will present some simulation results for 
analysis operator learning with different settings. The simulations are essentially based on two scenarios, synthetic 
operators and operators for image patches. In the appendix, we will investigate the local optimality conditions of 
operators for the proposed learning framework. 
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II. Analysis Operator Learning Formulation 

The aim of analysis operator learning is to find an operator f2 adapted to a set of observations of the signals 
y, := jM(xj), where x, is an element (column) of sample data X G M. nxl and .M(xi) denotes the information 
available for the learning algorithm. For our purposes, we set := x; + n^, where n; € R m denotes the 

(potential) noise. The data X is assumed to be of full rank n since otherwise, we can reduce the dimension of the 
problem with a suitable orthogonal transform and solve the operator learning in the new low dimension space. 

A standard approach for these types of model adaptation problems, is to define a relevant optimisation problem 
such that its optimal solution promotes maximal sparsity of Z := OX. The penalty || • || p := ^ | • \ p , < p < 1 can 
be used as the sparsity promoting cost function. As hinted in the introduction, we will use the convex i\ penalty. 
The extension to an £ p , p < 1, AOL is left for a future work. 

A. Constrained Analysis Operator Learning ( CAOL) 

Momentarily assume that we know X. Unconstrained minimisation of ||f2Xj|i, based on ft, has some trivial 
solutions. A solution for such a minimisation problem is fl = 0! A suggestion to exclude such trivial solutions 
is to restrict the solution set to an admissible set C. Not assuming that X is known any more, AOL can thus be 
formulated as, 

min|inX||i, s.t. fteC, \\Y-X\\ F <o (3) 
n,x 

where a is the parameter corresponding to the noise. We call the AOL to be a noiseless operator learning when 

(7 = 0. 

We prefer to use an alternative, regularised version of (0, using a Lagrangian multiplier A, to simplify its 
optimisation. The reformulated AOL is the following problem, 

min||f2X||i + -||Y-X|||, s.t. f2 G C (4) 
If A — > oo, problem (|4]i is similar to the noiseless case. In the following section, we explore some candidates for 

C. 

B. Constraints for the Analysis Operators 

A suitable admissible set should exclude undesired solutions of AOL while minimally affecting the rest of the 
search space. We name $1 = and RankjO} = 1 as some undesired solutions of (01. These operators clearly do 
not reveal low-dimensional structures of the signals of interest. For simplicity, we again assume that X is given, 
i.e a = 0. 

Here, we intially propose some constraints for the problem ([3j and explain why some of them can not individually 
exclude undesired solutions. A combined constraint C, which is the Uniform Normalised Tight Frame (UNTF), will 
be introduced subsequently. The proposed constraint is smooth and differentiable. 



7 




Fig. 1. Data clouds around some union of subspaces and possible analysis operators: a) an ideal operator, b) the optimal (rank-one) operator 
using a row norm constraint, c) the optimal operator using the full-rank, row norm constraint and d) the optimal operator using tight frame 
constraint. 

1) Row norm constraints are insufficient: The first constraint is on the norms of rows of SI, i.e. = c for 
the i th row. We can find the solution of the optimisation by first finding lu* £ K™ that minimises ||w T X||i. Then, 
the optimum is obtained by repeating oj*, i.e., Sl'l := [a;, = w]?^ a i is the minimum. Of course, Rank{JlJ} = 1 
(see Figure [T|b), and the solution is not interesting enough. 

2) Row norm + full rank constraints are insufficient: The set Cp of full rank operators is not closed, and the 
operator SI* belongs to its closure. As a result, the problem (|3} with a = does not admit a minimiser but there 
is a sequence of operators, arbitrarily close to St*, that yield an objective function arbitrarily close to its infimum 
value. 

3) Tight frame constraints are insufficient: In a complete setting a = n, an orthonormality constraint can resolve 
the ill-conditioning of the problem. The rows of SI are geometrically as separated as possible. Letting a > n, the 
orthonormality constraint is not further applicable, i.e. more rows than the dimension of space. In this case, the 
orthonormality constraint in the ambient space, Vi ^ j, oj 1 J-UJ : ' and ||a/||2 = l^^'ib = 1, where uj l andw- 7 E M. a are 
respectively the i th and j th columns of SI, is possible. The admissible set of this constraint is the set of tight frames 
in R axn , i.e. Sl T Sl = I, where I is the identity operator in R". The admissible set C = {SI e R axn : Sl T Sl = 1} 
is smooth and differentiable, which is a useful property for optimisation over this set. Such a constraint actually 
constructs a manifold in the space of IR ax ™, called the (orthogonal) Stiefel manifold St(a,n) l37l . 

Although the tight frame constraint may look appropriate to avoid "trivial" solutions to ([3j, preliminary empirical 
and theoretical investigations indicate that the analysis operator minimising (f3), using this constraint, is always an 
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orthonormal basis completed by zero columns (see Figure [T]d). This is possibly caused by the fact that the zero 
elements in such operators, are orthogonal to all finite signals, without giving any insight about their directions, 
and the operators can thus be truncated to produce complete operators. Therefore, this constraint does not bring 
anything new compared to the orthonormality constraint. 

4) Proposed constraint: Uniform Normalised Tight Frame: This motivates us to apply an extra constraint. Here, 
we choose to combine the uniformly normalised rows and the tight frame constraints, yielding the UNTF constraint 
set. This constraint will be used in the rest of paper and the analysis of Appendix [A] is based on this admissible 
set. 

The UNTF's are actually in the intersection of two manifolds, uniform normalised (UN) frames manifold and tight 
frames (TF) manifold. There exist some results that guarantee the existence of such tight frames, i.e. non-emptiness 
of the intersection of two manifolds, for any arbitrary dimensions n and a ll38l . This is indeed an important fact, 
as the optimisation with such a constraint is then always feasible. 

This constraint can be written as follows, 

c = {sie R QXn : si T si = i, v» |H| 2 = c}, (5) 

and u>i is the ith row of SI. Since C is not convex, may have many separate local optima. Finding a global 
optima of is not easy and we often find a local optima using variational methods. Using variational analysis 
techniques, we can also check if an operator SI is a local optimum. Details are presented in Appendix [A] 

The exact recovery, using a variational analysis, in this setting, is essentially based on the following property, 
Definition 1 (Locally Identifiable): An operator SI is locally identifiable from some (possibly noisy) training 
cosparse signals Y, if it is a local minimum of an optimisation problem with a continuous objective and locally 
connected constraint. 

This definition of local identifiability is a generalisation of the (global) identifiability as here we can not find the 
global solution. Such a property has been previously investigated in the context of dictionary learning in ll36l and 

ED. 

C. Extension of UNTF Constraints 

In some situations, we may not necessarily want to require that SI form a frame for the whole of R™. The 
inspiration for such a case comes from the finite difference operators which is closely tied to the popular TV-norm. 
When we model cartoon-like piece-wise constant images, an analysis operator SI need only to detect transitions in 
neighboring pixel values. Therefore, all the rows of SI can be orthogonal to the DC component. 

We can facilitate learning similar rank-deficient operators by extending the UNTF constraints. Here, we know 
that SI _L Af, where Af is the known null-space. We can modify (0 to construct the following constraint, 

C M = {SI G R axn : Sl T Sl = TV, ||^|| 2 = c}, (6) 
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where V N ± is the orthogonal projection operator onto the span of Af^. 

In the following, we introduce a practical algorithm to find a suboptimal solution of (0]), constrained to C or Cjv, 
using a projected subgradient based algorithm. 

III. Constrained Analysis Operator Learning Algorithm 

Cosparse signals, by definition, have a large number of zero elements in the analysis space. In this framework, 
a convex formulation for recovering a signal x from its (noisy) observation is to minimise the following convex 
program, 

min||nx|| 1 + ^||y-x||2, (7) 
x 2 

where fi is a fixed analysis operator. As © is a convex problem, many algorithms have been proposed to solve 
this program very efficiently (see PTfl . HTI and Il42l as some examples of the latest methods). These methods can 
easily be extended to the matrix form to find X given Y. Unfortunately, the AOL problem ©, due to the freedom 
to also change fi 6 C in this formulation, is very difficult. 

This difficulty is similar to what we face in the dictionary learning for synthesis sparse approximation. In the 
dictionary learning problem, we have also a joint optimisation problem (O that, in a simple setting, can be minimised 
in an alternating optimisation approach B31 as follows, 
Dictionary Learning: 
initialisation: D^, V' ' = 0, i = 0, 
while not converged do 

V [i+i] = ar g m i nv nvlli + |||X - DHV|||, 
D[*' +1 1 = argmin DeX) ||X - DV[ i+1 l|||, 
i = i + l 
end while. 

An alternating minimisation technique can also be used for the AOL problem to seek a local minimum. In this 
framework we first keep X fixed and update the operator ft. We then update the cosparse approximations of the 
signals while keeping the operator fixed. Such an alternating update continues while the new pair (X, f2) is very 
similar to the previous (X, fi) or is repeated for a predetermined number of iterations. A pseudocode for such an 
alternating minimisation AOL is presented in Algorithm Q] Analysis Operator Learning Algorithm (AOLA). Each 
subproblem, i.e line [3] or line |4] of AOLA can be solved separately based upon a single parameter. Although the 
problem in line [4] is convex and it can thus be solved in a polynomial time, the problem in line [3] is not a convex 
problem due to the UNTF constraint C. 

Note that when the cosparse matrix X is given, i.e. in a noiseless scenario a = 0, the algorithm only has the 
operator update step, which we need to repeat until convergence. This step is also called noiseless AOL ||29l . 
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Algorithm 1 Analysis Operator Learning Algorithm (AOLA) 
l: initialisation: flM, X °l = Y, i = 0, 
2: while not converged do 

3: rj[' i+1 ] = argmin 0ec H^X^Hi, 

4: Xl i+1 ] = argmin x ||f^ +1 lX|| i + |||Y - X||| 

5: i = 1 + 1 

6: end while. 



A. Analysis Operator Update 

We use a projected subgradient type algorithm to solve the operator update subproblem in line [3] of the AOLA. 
The subgradient of the objective is df(Sl) = sgn(f2XM)XW T , where sgn is an extended set-valued sign function 
defined as follows, 



{sgn(A)} 4< =sgn(A ij ) 

1 a > 0, 

sgn( a ) = \ [-1,1] a = 0, 
-1 a<0. 

In the projected subgradient methods, we have to choose a value in the set of subgradients. We randomly choose 
a value in [—1, 1], when the corresponding element is zero. 

After the subgradient descent step, the modified analysis operator is no longer UNTF and needs to be projected 
onto the UNTF set. Unfortunately, to the authors' knowledge, there is no analytical method to find the projection 
of a point onto this set. Many attempts have been done to find such an (approximate) projection, see for example 
P4l for an alternating projections and P31 for an ordinary differential equations based method. We rely on an 
alternating projections approach. 

Projection of an operator with non-zero rows, onto the space of fixed row norm frames is easy and can be done 
by scaling each row to have norm c. We use Vtjn to denote this projection. If a row is zero, we set the row to 
a normalised random vector. This means that Tun is not uniquely defined. This is due to the fact that the set of 
uniformly normalised vectors is not convex. The projection can be found by, 

v UN {n] = [v UN {ui}] i: 

v otherwise, 
where v is a random vector on the unit sphere. 

Projection of a full rank matrix onto the tight frame manifold is possible by calculating a singular value 
decomposition of the linear operator iPBI . Let ft £ R ax " be the given point and O = US V T be a singular 
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value decomposition of ft and I axn be a diagonal matrix with identity on the main diagonal. The projection of ft 
can be found using, 

Ttf{£1} = UI ax „V T . 

Note that, although there is no guarantee to converge to a UNTF using this method, this technique practically 
works very well P4l . As the projected subgradient continuously changes the current point, which needs to be 
projected onto UNTF's, we only use a single pair of projections at each iteration of the algorithm, to reduce 
the complexity of the algorithm, where the algorithm now asymptotically converges to a UNTF fixed point. To 
guarantee a uniform reduction of the objective f(ft), we can use a simple line search technique to adaptively reduce 
the stepsize. It indeed prevents any large update, which increases . A pseudocode of this algorithm is presented 
in Algorithm [2] where K max is the maximum number of iterations. 

For the constraint set Cm mentioned in Section III-C1 we present a simple modification to the previous algorithm 
in order to constrain the operator to have a specific null-space. Here, we need to find a method to (approximately) 
project an operator onto C^/. Following the alternating projection technique, which was used earlier to project onto 
C, we only need to find the projection onto the set {ft £ M. axn : ft T ft = P u ±}, as we already know how to 
project onto UN, i.e. V UN . To project an ft onto the set of TF's in Af- 1 , we need to compute the singular value 
decomposition of ft, projected into the orthogonal complement space of Af. The projection onto Af , 'P jV ±, can 
be found using an arbitrary orthonormal basis for Af and simply subtracting the projection onto this basis, from 
the original ft. We can then rewrite the decomposition as T' M ±{ft] = UE V r . If the dimension of Af is r, only 
n — r singular values of V M ± {ft} can be non-zero. We find the constrained projection as follows, 

V T F±N{to) = UIr QX „V T , 

where matrix Ir ax „ is a diagonal matrix in R ax ™, with identity values on the first n — r diagonal positions, while 
the last r elements are set to zero. 

We therefore need to alternatingly use Vtf±M and Vtjn to find a point in the intersection of these two constraint 
sets. In Algorithm 12 we only need to replace Vtf with VtfjlN to consider the new constraint Ctf. 

B. Cosparse Signal Update 

When ft is known or given by line [3] of AOLA, a convex program based on X needs to be solved. Although this 
program is a matrix valued optimisation problem, it can easily be decoupled to some vector valued subproblems 
based upon the columns of X. One approach is to individually solve each subproblem. Here we present an efficient 
approach to solve this program in the original matrix-valued form, despite the challenging nature of it, due to its 
large size. 

The main difficulties are, a) £i is not differentiable and b) ft inside the i-y penalty, does not allow us to use 
conventional methods for solving similar l\ penalised problems. We here use the Douglas-Rachford Splitting (DRS) 

2 The stability of algorithm is guaranteed in a Lyapunov sense, i.e. the operator fl[ fc j is enforced to remain bounded, when k — > oo. 
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Algorithm 2 Projected Subgradient Based Analysis Operator Update 
Input: XW, if m ax, nM, 77, e < 1, p < 1, 

initialisation: k = 1, O [0] = 0, = 

while e < ||n [h] - n^-ijH*. and k < K max do 

n ea/(n w )=Bgn(n w xW)xw T 

= "Pc/jv {"Ptf {^ [fe] - 77 O g }} 
while f(to [k+1] ) > /(n w ) do 

n [fc+1] = t^c/at {tvf {r2 [fc] - 77 n G }} 

end while 

k = k + 1 
end while 

output: ni i+1 i = n [fc _u 



technique to efficiently solve the cosparse signal approximation of the AOLA. It is also called the alternating direction 
method of multipliers (ADMM) in this setting, see ll46l for a brief overview. This technique has indeed been used 
for the Total Variation (TV) and analysis sparse approximations in l40l l47lP . We here only present a simple version 



of the DRS technique, carefully tailored for this problem. The problem is a constrained convex program with two 
parameters Z = fi^+^X and X as follows, 

min||Zj|i + ^||Y -X||| s.t. Z = [l+1] X. (8) 

The Augmented Lagrangian (AL) ll49l method is applied to solve dS). In the Lagrangian multiplier method we 
use the dual parameter B € R axi and add a penalty term, (B,fiI l+1 lX — Z). In the AL, we also add an extra 
quadratic penalty related to the constraint and derive the new objective g(K, Z,B) as follows, 

s(x, z, b) HizHi + ^|| y - xf F + 7 (b, n^x - Z 

+ l\\n^X-Z\\% 



K V V ,|2 , 7, 

2 



I Z|| 1 + ^||Y - X||| + ^||B + n^x - Z\\ 2 F 



where < 7 £ R + is a constant parameter. According to the duality property, the solution of max B min x z <?(X, Z, B) 
coincides with the solution of ([8). Using the DRS method, we iteratively optimise a convex/concave surrogate 
objective <v s (X, Z, B, B^j) = <?(X, Z, B) — ||B — B^ \\ F , where B[fc] is the current estimation of B. The fixed points 



3 1471 derives the formulation by incorporating the Bregman distance. However, it has been shown that the new method, called Alternating 
Bregman Splitting method, is the same as DRS, applied to the dual problems 1481 . 
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Algorithm 3 DRS Based Cosparse Signal Update 
Input: i, K max , X^, 7, A, e < 1, A, 7 

l: initialisation: k = 1, X [fe] = X^, f2 = B H = 0, 

2: while e < ||X [fc] - X [fc _ 1] || F and k < K max do 

3: X [k+1] = (AI + 7 ^)- 1 (AY + 7^ T (Z [fe] -B [fe] )) 

4: = Si {r2X[ fe+1 j + B[ fc ]} 

5: = Bl fc ] + (fiX[ fe+1 ] - Z[ fe+1 j) 

6: fc = k + 1 

7: end while 

8: output: X[' i+1 l = X [fe _ 1] 



of the iterative updates of <? S (X, Z, B, Brw ) are the same as p(X, Z, B), as the extra term ||B — B[fe] ||f. vanishes in 
any fixed points. <7 S (X, Z,B,Bny) is convex with respect to Z and X and concave with respect to B. We can thus 
iteratively update each of the parameters, while keeping the rest fixed, see Algorithmic] In this algorithm, S a , with an 
a > , is the entry wise soft-threshold operator defined by S a ((3) = (5 — a sgn(/3) if > a and otherwise l50l . 
Note that the update formula for X in Algorithm [3] line [3] involves a matrix inversion, which is computationally 
expensive. As $7 is a tight frame here, the matrix inversion is significantly simplified using the fact that fi T fi is 
identity. In this case, the operator (AI + 7 J1 T 0) _1 is simply a scaling with 

We iterate Algorithm [3] for a number of iterations K max or until the parameters cease to change significantly. 
Although the convergence of the iterative updates of this algorithm can separately be investigated, it can also be 
deduced using the fact that it is a particular case of DRS, which converges under mild conditions B6l . 

IV. Simulations 

We present two categories of simulation results in this section. The first part is concerned with empirical 
demonstration of (almost) exact recovery of the reference analysis operators. An astute reader might ask: why 
should we care about exact recovery at all? Is it not enough to show that learned operators are good? This is, of 
course, true. However, when the signals are generated to be cosparse with respect to a reference operator or they 
are known to be cosparse with respect to a known analysis operator, the recovery of those operators is a good 
demonstration of the effectiveness of our method. Going further, one may imagine situations where the rows of 
the reference analysis operators explain certain properties/dynamics of the signals; for example, finite differences 
for piecewise constant images detect edges. Obviously, the recovery of reference operators in these contexts is 
significant. 

In the second part, we are interested in demonstrating the effectiveness of learned operators in a practical task, 
namely, image denoising. 
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Average Percentages of Operator Recovery 




Cosparsity 



Fig. 2. The average percentage of operator recovery for different 7's, where 7 controls how far is the starting point fij„ form fin. The x-axis 
presents the cosparsity of the synthetic data. 



A. Exact recovery of analysis operators 

For the first experiment, the reference operators and the cosparse signal data set were generated as follows: A 
random operator fl n - G M 24x16 was generated using Lid. zero mean, unit variance normal random variables^. The 
reference analysis operator f2 is made by alternatingly projecting Q - onto the sets of UN's and TF's, A set of 
training samples was generated, with different cosparsities, by randomly selecting a normal vector in the orthogonal 
complement space of a randomly selected q rows of fi . Such a vector has (at least) q zero components in £l yi, 
and it is, thus, q cosparse. 

To initialise the proposed algorithm, we used a linear model to generate the initial f2 by combining the reference 
operator fl a and a normalised random matrix N, i.e. $l in = f2 + 7N, and then alternatingly projecting onto UN 
and TF. It is clear that when 7 is zero, we actually initialise f2 with the reference model fi and when 7 — » 00, 
the initial ft in will be random. 

First, we chose a set of size I = 768 of such training corpus and used the noiseless formulation (0, where a = 0. 
The AOL algorithm in this setting is just the projected subgradient base algorithm, Algorithm [2] which was iterated 
50000 times. To check the local optimality of the operator and the size of basin of attraction, we chose 7 = 0, 1, 10 
and 00. The average percentage of operator (rows) recovery, i.e. the maximum £2 distance of any recovered row 
and the closest row of the reference operator, is not more than V-001, for different cosparsity and 100 trials, are 
plotted in Figure [2] We practically observe that the operator is the local optimum even when the cosparsity of the 
signal is as low as 3. We also see that the average recovery reduces by starting from a point far from the the actual 
reference operator or a random operator, which is the case 7 — > 00. 

We now investigate the role of I on the average operator recovery by some simulations. We kept the previous 
experiment settings and repeated the simulations for two new training sets, with populations of I = 384 and 1536, 

4 fl - is not necessarily a UNTF and needs to be projected onto the set of UNTF's. 
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y=0 7=1 7=10 




246 246 246 




Fig. 3. The average percentage of operator recovery with different training set population size I. The x'-axis presents the cosparsity of the 
signals. 



(a) (b) 




cosparsity cosparsity 

Fig. 4. The local-identifiability check by randomly generating vectors in the admissible set of Lemma [5] of Appendix [A] and checking the 
credibility of the inequality in (a) Lemma [5} Equation H7Y , and (b) Theorem [T} Equation 1181 . 



which are 1/2 and 2 times of the population in the previous experiments. We show the average operator recovery 
for 7 = 0, 1, 10 in Figure [3] The simulation results show not only that Sl can be locally identified with even 
less cosparse signals, smaller q, but the basin of attraction is also extended and now the reference operator can be 
recovered by starting from a distant initial point, even using 2 cosparse signals. 

In the next two experiments, we show that if the cosparsity is low, i.e. small q, then the analysis operator cannot 
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Fig. 5. A512by512 Shepp-Logan phantom image which was used as the training image. 
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2-times Overcomplete 
Haar Operator 



Fig. 6. The rows of the full-rank learned operator (left panel), the rank-deficient learned operator (middle panel) and a two-times oevrcomplete 
Haar operator (right panel), in the form of 8 by 8 blocks. 



be "recovered": it is not a solution of the proposed optimization problem (01. One way to demonstrate it would 
consist in solving the large scale convex optimisation program $15[ expressed in Proposition |2] of the Appendix 
lAl Alternatively, we can use Lemma [3] (respectively Theorem [U of the appendix, to possibly find a matrix A z , 
which violates the conditions. Here, we randomly generated 1000 A z in 7c,, and checked whether the inequality 
( fTTI i (respectively (fT8l i) was satisfied. We repeated this process for 10 different pseudo-randomly generated f2 and 
plotted the percentages of A z satisfying the inequality in Figure |4] respectively in the left column (a) and the right 
column (b). We have also mentioned the training size, i.e. I on each row of this figure. From left column, we can 
see that with 1-cosparse signals, there are operators which are not local minimum of the program (|3) (cr = 0). We 
can also see that the relaxation used to derive Theorem Q] makes the result of this Theorem very conservative, i.e. 
there are many cases which do not satisfy this theorem, but they still can be locally-identifiable. 

For the last experiment, we chose the Shepp Logan Phantom, a well-known piecewise constant image in the 
Magnetic Resonance Imaging (MRI) community (see Figure[5]). We used I = 16384 blocks of size 8 by 8, randomly 
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Fig. 7. The value of objective in I14t . in comparison with a base line, i.e. the objective value with a two times overcomplete Haar wavelet 



operator. 



selected from the image, such that, except one DC training sample, all the training images contain some edges. We 
learned a 128 x 64 operator for the vectorised training image patches, by initialising the (noise-less) AOL algorithm 
with a pseudo-random UNTF operator. We used two different constraints C and for the learning operator f2, 
i.e. © and (O, where the null space was selected to be constant images. We chose the second constraint to see if 
we would find a different operator by enforcing a similar null -space to the Finite Difference (FD) operator. 

The learned operators are shown in Figure |6] We found operators which have many FD type elements. The 
operator which we learned using C in © seems to have more FD type elements. We also compare the level of 
objective value for the learned operator, using C in ©, and a reference operator. As a two dimensional FD operator 
is not a UNTF, we chose a two-times overcomplete Haar wavelet — see the right panel of Figure [6] — which has 
some similarities with the FD, in the fine scale. The objective of (O for the learned operator in different steps of 
training and the Haar based operator are shown in Figure [7] This shows that the learned operator finally outperforms 
the baseline operator in the objective value. Note that if we initialise the algorithm with the Haar based operator, 
the AOL algorithm does not provide a new operator. This empirically shows that the reference operator is a local 
optimum for the problem (2), where A^oo. This fact can be investigated using the analysis provided in Appendix 
lAl i.e. randomly generating admissible vectors and checking (flTt or ( fl~8l l. 

B. Image denoising with learned analysis operators 

For the next experiments we used a set of face images which are centred and cropped [51]. Such images can be 
modelled as approximately piecewise smooth signals. A pseudo-random admissible SV- ' € fl^i28x64 jj as b een usec j 
as an initial analysis operator and a training set of size I = 16384 of 8 x 8 image patches from 13 different faces, 
after normalised to have mean zero, have been used to learn the operators (a = 128). 

We applied both of the noiseless AOL Algorithm and the noise-aware AOL Algorithm to demonstrate how much 
the cosparsities of the training samples increase using the noise aware formulation. (NL)AOL algorithm was iterated 
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Fig. 8. Signals in the analysis space M. N , N = 128. The coefficients with almost zero magnitude, i.e. less than 0.01, are indicated with stars. 
The cosparsity in each case is: (a) p = 0, (b) p = 1, and (c) p = 27. 
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Fig. 9. The cosparsities of y (bottom plot) and y (top plot) respectively with the operators learned with (NL)AOL and NAAOL. 



-Kmax = 100000 times and NAAOL algorithm iterated for Knmx = 10 iterations with A = 7 = 0.5, while the 
inner-loop, i.e. Algorithm [2] was iterated 100000 times. 

A plot of the analysis coefficients for a selected y along with its corresponding cosparsities, with three different 
fl's, are presented in Figure The initial operator fio = J~^ ' has been applied to y in (a). Not surprisingly, the 
signal is not cosparse with this arbitrary operator (q = 0). In (6), the same plot is drawn using the learned operator 
with (NL)AOL. Although some coefficients are small, most are not zero, and q = 1. In the last plot, we have shown 
the analysis coefficients for x using the learned operator with NAAOL. It is clear that the cosparsity has been 
increased significantly (from q = 1 to q = 27). We have further plotted the cosparsities of the first 256 training 
samples y's using the learned operator found by (NL)AOL and corresponding approximations x's, which are found 
by NAAOL, in Figure [9] This figure also shows, the operator learning using the noise aware formulation (2), where 
A is finite, results in much greater cosparsity. We show the learned operator found using NAAOL algorithm in 
Figure [10] This experiment suggests that a harmonic type operator should perform better than FD based operators 



19 



for such images. This is an interesting observation which should be explored in the future in more detail. 

The aim of next experiment is to compare denoising performance of an image using a learned operator and a 
FD operator ll52l which is similar to the analysis operator in the TV denoising formulation and is shown in the 
right panel of Figure ITOb . We keep the previous experiment settings. The learned operator and the FD operator 
can now be used to denoise a corrupted version of another face from the database, using ||7). The original face is 
shown in Figure QT| (a) and the noisy version, with additive i.i.d Gaussian noise, is shown in (b), with PSNR = 
26.8720 dB. Denoising was performed using two different regularisation settings: (A = 7) = 0.3, 0.1. The bottom 
two rows show the denoised images using the FD operator and the learned operator. We can visually conclude 
that the two operators successfully denoise the corrupted images with some slight differences. The results with the 
learned operators are smoother (this is mostly visible on a screen rather than a printed copy of the paper). The 
average PSNR's of the denoised images over 100 trials are presented in Table HI] Although we get a marginally 
better average PSNR using the learned operator instead of FD when A = 0.3, we get a noticeably better average 
PSNR, i.e., 1 dB, when A = 0.1. 

As the initial goal was to increase the cosparsities of signals, we have also shown the cosparsities of different 
patches of the selected face image (see Figure [12}. The horizontal axis presents the index number of the patches. To 
compare the cosparsity using these operators, we have plotted their differences and its average in the bottom plot. 
Positive values here demonstrate the cases when the learned operator is a better operator than the FD operator. The 
average, which is indeed positive, is plotted as a horizontal line. As a result, the learned operator here performs 
much better (15% improvement) than the FD operator. 

The synthesis framework has been used for different applications with some very promising results. In the last 
experiment, we have a comparative study between the two frameworks, for denoising face images. We chose the 
settings of the previous experiment and used the patch based learning. For the reference, we used a two times 
overcomplete DCT, which is a standard selection for the sparsity based denoising lf53l . For the synthesis sparse 
representation, we used OMP method, and for the analysis sparse recovery, we used a convex formulation similar 
to the previous experiment. The optimum value of A in the l\ analysis was 0.04. The OMP was running to achieve 
an approximation error equal to 1.15a, where a is the standard deviation of the additive noise. This setting has 
been used in ll53l . for denoising with the learned synthesis dictionary using K-SVD method. We used another 
technique, which has also been used in the mentioned reference, to average out over the overlapping-blocks of 
images to reduce the blocking effect. The average PSNR's of the denoised face images over 5 trials, for the DCT 
dictionary/operator are shown in the first row of Table |lll] 

This experiment demonstrates that the synthesis framework works better in the denoising of face images. We 
already have techniques to learn dictionaries/operators for each of these three cases. For the OMP and analysis 
based denoisings, we respectively chose K-SVD f54l and the proposed CAOL methods. The Lagrange multiplier 
for the analysis based denoising was A = 0.09, while we used the operator learned in the previous experiment. The 
average PSNR's of the denoised face images over 5 trials are presented in the second row of Table [ill] 

By comparing the synthesis denoising techniques, with the overcomplete DCT and K-SVD dictionaries, we find a 
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Fig. 10. The learned analysis operator in a noisy setting, A = 0.1 (left panel) and the finite difference operator (right panel). 

TABLE II 

Average PSNR (dB) of the denoisedface images (100 trials) using, a) FD and b) the learned operators, when the 

AVERAGE PSNR OF THE GIVEN NOISY IMAGES WAS 26.8720 DB 





A = 0.3 


A = 0.1 


Finite Difference 


28.9643 


31.5775 


Learned Operators 


29.0254 


32.5202 



slight improvement (+0.2 dB) in the PSNR of the denoised images, using the learned dictionary. This improvement 
is more significant in the analysis denoising framework, as we achieved more than 2.5 dB improvement in the 
PSNR of the denoised image. In comparison between two frameworks, although the PSNR of the denoised image 
using the learned operator is significantly improved, it is marginally behind the synthesis competitor. However, we 
found these results very promising, since this is only the beginning of the field "denoising with the learned analysis 
operators". More experiments are necessary to find a reason for such a behaviour, which we leave it for the future 
work. 

V. Summary and Conclusion 

In this paper, we presented a new concept for learning linear analysis operators, called constrained analysis 
operator learning. The need for a constraint in the learning process comes from the fact that we have various trivial 
solutions, which we want to avoid by selecting an appropriate constraint. A suitable constraint for this problem 
was introduced after briefly explaining why some of the canonical constraints in model learning problems are not 
sufficient for this task. Although there is no claim that the introduced constraint is the most suitable selection, we 
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(a) (b) 




Fig. 11, Face image denoising. Top row: original face (left), noisy face (right). Denoising results using jTJ. Middle row: A = 0.3 using the 
learned analysis operator (left) and the finite difference operator (right). Bottom row: same as middle row with A = 0.1. 



150 

iooh 

50 




Cosparsity with the learned operator 



150 
100 - 
50 
L 



50 100 150 200 250 300 350 400 450 500 
Cosparsity with the Total Variation operator 

50 100 150 200 250 300 350 400 450 500 
Differences between cosparsities 




50 100 150 200 250 300 350 400 450 500 
Image patch number 



Fig. 12. Cosparsities of image patches (a)-(b) and a comparison (c). 



practically observed very good results with the introduced algorithm. 

In the simulation part, we showed some results to support our statements about the relevance of the constraint 
and the learning algorithm. We actually demonstrated that the learning framework can recover a synthetic analysis 
operator, when the signals are cosparse and enough training signals are given. By applying the algorithm on two 
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TABLE III 

Average PSNR (dB) of the denoised face images (5 trials), when the average PSNR of the given noisy images was 

APPROXIMATELY 26.6 DB (VARIES SLIGHTLY IN DIFFERENT EXPERIMENTS) 





Synthesis (OMP) 


Analysis (£i) 


DCT 


34.63 


31.6 


Learned 


34.87 


34.22 



different types of image classes, i.e. piecewise constant and approximately piecewise smooth images, we learned 
operators which have respectively similarities with the finite difference and harmonic type operators. This observation 
emphasises the fact that we benefit from selecting the most appropriate analysis operator for image processing tasks. 
The Matlab code of proposed simulations will be available, as a part of the dictionary learning toolbox SMALLbox 
Il55l , later. 

The proposed constrained optimisation problem is a non-convex program, for which we can often find a local 
optimum using variational optimisation methods. In the appendix, we characterise the local optima of such programs. 
This can be useful to avoid initialising the algorithm with a local minima, and also for empirically checking when 
an operator can be a local minima. The latter emphasises the fact that we can recover such an operator, by starting 
the algorithm with a point in a close neighbourhood of the operator, which we do not know a priori. 

Appendix A 
Theoretical Analysis of the CAOL 

If we rewrite the optimisation program (2), using the UNTF admissible set, we find the following program, 

min||fiX||i + -||Y-X||2 s.t. fl T fl = I 

V, |K|| 2 = c. 

The variational analysis f56l of the objective, near to a given pair (fio,Xo), provides the following proposition. 
Proposition 1: A given pair (S7q; Xo) is a local minimum for (O if and only if A = and E = are the only 
solutions of the following convex program, 

£a(A,£) 



min||n X + AX +n E||i + Atr{E r (X - Y)} 

4Tn „ TA (10) 
s.t. A 1 flo + fl^ A = 

Vi (w0i,Si) = 0. 

Proposition Q] is actually checking the local optimality of a pair, using the first-order approximation of the 
objective, which can be achieved using X <— Xo + E and f2 <— f2o + A. The constraint of ([TUt is the tangent space 
of the constraint of (|9]l at f2o, 

Tc ■= {A : A t J1q + £IqA = & V^ (wo 4 ,&) =0}, 
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which is a linear subspace of R ax ™. 

This proposition will be used in the next two subsections to investigate the conditions for the local optimality of 
a point in two different scenarios: (NA)AOL and NLAOL. 

A. Noise Aware Analysis Operator Learning 

The tightest condition for the local optimality of pair (flo, X ), is the following necessary and sufficient condition. 
Lemma 1: A pair (Ho,X ) is a local optimum of (0 if and only if the following inequality holds, 

||(AX + fto£) A ||i > I (AX + n E,sgii(noXo)) 

(11) 

+ Atr{E T (X -Y)}|, 

for all non-zero (A, E) £f c x R" x '. 

Proof: According to the proposition Q] we can check the optimality of (Oo,Xo) by checking that (A,E) = 
(0, 0) is the only solution of JlOt . The proof here is formed by showing that the objective of ( TlOb increases if 
(O ,X ) is replaced with (fJ + A, X + E), or equivalently, lim^o ±(£ A (*A,*E)-£ A (0,0)) > 0, VE,VA € tM, 
when dTTb is assured. From the definitions of £ A (A,E) in ( TTOb and ||A||i = (A, sgnA), we have, 

£ A (tA,tE)-£ A (0,0) 

=||f2oXo + i(AXo + n E)||i - ||n Xo||i + iAtr{E T E} 
= (0 Xo + t(AX + n E), sgn(n X + t(AX + fi E))) 
- (n Xo,sgn(n Xo)) + iAtr{E T E} 

tj,o 



(12) 



;t(AX + rj E,sgn(f2 Xo)) 

+t(AX + n E,sgn((AXo + n E) A )) +tAtr{E T E} 

=t(AX + noE,sgn(n Xo)) 

+t\\ (AX + n S)A||i + tX tr{E T E} 
where E := Xo — Y. The positivity of ( TT~2T > is assured when 

IKAXo + OoE^ll! > | (AX + n E,sgii(noXo)) 
+ Atr{E T E}|. 

This completes the sufficiency of (fTTT i for the optimality. If (fTTT > is violated, 3(A, E) € 7~c X R mxL such that 
| (AX + r2 E,sgn(r2 X )) + Atr{E T E}| is greater than ||(AX + J1 E)aIIi- As the constraint Tc, is a linear 
space, when A and E are admissible, ±A and ±E are also admissible. If we set s = sgn((AXo + rirjE, sgn(floXo)}- 
Atr{E T E}) e {-1,1} and (A WC ,E WC ) = (sA,s£) eT c x W ixl , it is easy to show that 

+ O E wc ,sgn(rj X )) + Atr{Ej, c E} 

+ (A WC X + n E wc ,sgn((A tuc X + O E tuc ) A )) < 0, 



5 This condition for the (local) optimality is related to the positivity condition of directional derivative of C\ for any admissible A and S in 
the tangent cone(s) of the constraint(s), see for example 1561 . 
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which contradicts the optimality of (n ,Xo), i.e. 3(A,E) e Tc x M™ xi , lim ti0 i(£A(tA,i£) - £a(0,0)) > 0. 
This completes the necessary part of the lemma. ■ 
Remark 1: As lemma Q] is valid for any non-zero admissible pair (A, E), we can choose (A, 0) and (0, S) and 
get the following necessary conditions for optimality, 

||(AX ) A ||i > | (AX , S gn(n X )) |, VA e T c , 

||(n E) A ||i > |(0 E,sgn(n Xo))+Atr{E T E}|. 

Lemma 2: A sufficient condition for the local optimality of a pair (tlo, Xq) for (0 is, 

||(AXo + noS) A ||i > ||(AXo + n E)A||i 

(13) 

+ A|tr{E T (X -Y)}|, 

for all (A, E) e T c X R nxl . 

Proof: The proof is based on finding an upper bound for the right hand side of (fTTT i. Note that the maximum 
of (AXo + J~2oE, sgn(OoXo)) may be achieved when sgn(AXo + J~ioE) is equal to sgn(fioXo) on A. Thus, 

(AX + rJoE,sgii(0 Xo)) < ||(AXo + noS) A ||i. 

This means that condition ( fT3l implies the condition of Lemma Q] and (f2o,Xo) is thus a local minimum for (|9). 

■ 

Remark 2: Note that lemma [2] only presents a sufficient condition, as the sign pattern match may not happen, 
i.e. sgn(AX + O E) ^ sgn(ft X ), VA 6 T c , E. 

Remark 3: Although the recovery condition here has some similarities with the dictionary recovery conditions 
ll36l . (39), the analysis learning formulation is slightly more involved as it is not possible to check the local 
optimality based on each parameter, ft or X, separately. This is caused by the fact that the non-differentiable term, 
i.e l\, is a function of both parameters. Here, it makes the problem non-separable and the joint optimality condition 
is needed to be checked. 

B. Noiseless Analysis Operator Learning 

When the noise and model mismatch does not exist, we can solve (0 with A->oo. This case may be considered 
as an ideal operator learning form, as Y is here exactly cosparse. This formulation can be used as a benchmark for 
the operator recovery, very similar to the framework in [f36l , l39l for dictionary recovery. In this setting we have 
X = Y € K raxi , we can simplify the problem as follows, 

min||nX||x s.t. T f2 = I 

(14) 

V. t \\u}i\\ 2 = c. 

As this formulation has a single parameter fi, the analysis is significantly easier. ([141 has n x a, i.e. Q a xn, 
unknown parameters and n 2 + a constraints. For a full rank X, the system of equations is underdetermined if 
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na > n 2 + a. As we assume X is rich enough and therefore full rank, minimising the objective ||OX||i is 
reasonable if a > n 2 /(n— 1). 

We use formulation ( TT4T > to recover operator Oo, while X is cosparse with Oo- The following proposition 
investigate the local optimality conditions of a point Oo for dl4t . which is actually equivalent to Proposition Q] 
when A — > oo. This proposition has a flavor similar to those obtained in the context of dictionary learning for the 
synthesis model [36), as well as in hybrid synthesis/analysis framework ll57l . 

Proposition 2: An operator O is locally identifiable using (TPfl i, if and only if A = is the only solution of the 
following program, 

min || (O + A)X||i s.t. A T O + O^A = 

(15) 

Vi (u> 0i ,6i) = 0. 

Note that ( fT~5T > has a convex objective with a linear constraint. This matrix valued convex problem has a close relation 
to the analysis parsimony problems investigated in lfT2l . |fl~3), ifTTI . fl4l . where X T acts as the analysis operator 
of the columns of (Oo + A) T . We present some conditions for locally identifiability of a Oo using variational 
analysis. A similar technique for the analysis parsimony problems have already been used in IfTTI . Ifl4l . 

The kernel of the matrix X has an important role in the identifiability of the analysis operator. Let X nx ; = 
UnxnSnxjVL; be a singular value decomposition of X. We can now partition V into two parts, i.e. one part 
is multiplied to Si = diag(<7i) € R" x ™, which is the non-zero diagonal part of E, and Vo, a basis for the kernel 
of X, and find X = UEiVY and X f = Vi£j~ 1 U T . Let two new parameters be defined as Zq := OoX , A z := 
AX , g M. axl . We can now reformulate (fl31 l based on the new parameters as follows, 

minHZo + A.ll! s.t. Vf (A^Z + Z^A Z ) V 1 = 

A 2 (V!l]j; 2 V[)ZJoI = (16) 
A Z V = 

where o is the Hadamard product. The constraint of (fTol is linear, which it means, we can represent it based on 
a linear operator \& : IR a ' — > M n + a + a ( ( -™) and the vector version of A z , 8 V Z = vect(A 2 ) as M/<5j = 0. Such an 
operator is derived in Appendix 151 \1/ has a nontrivial kernel when a > n /{n — 1). We can now characterise the 
kernel subspace of the constraint of (fT6l . Tc B , as follows, 

T Cs ={Ae M ax/ : =0,6" = vect(A)} 

Lemma 3: An operator Oo is identifiable using the cosparse training samples X, where Vi : supp(Ox^) = Aj, 
and (O, if and only if for all A z e Tc s , 

|( S g n (O X),A z )|< ||(A,) A ||i (17) 

where A is the cosupport of X, i.e. A = : (OoX),*j = 0}. 

Proof: As O is a convex problem, A = is the only solution iff j| (Oo+tA)Xo||i — |OoXq || i > 0, VA 6 Tc s 
and 1 10. This is equivalent to the statement based on ( TToT ) as follows, A = A Z X^ = is the only solution of (fT~5b 
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iff 1 1 Zo + iA z j|i — 1 1 Z o 1 1 1 > 0, VA Z e Tc s and 1 1 0. We can now find the necessary and sufficient condition for 
the subject as follow, 

||Z +tA,||i- || Zo || i = (Z +tA z ,sgn(Z +tA z ))- ||Z ||i 

f i° (Zo+tA 2 ,sgn(Z )) 

+ t(A,,sgn(A 2 )>-||Z ||i 

= t <A„ sgn(Z )> + 1|| (A^) A || i 
As t > 0, positivity of ||Zo+£A z ||i— ||Zo||i is equivalent to the positivity of (A z ,sgn(Zo)) + ||(A z )^||i. l~c a 
is a linear space and therefore if A 2 G 7c s > — A 2 g 7c, too. This freedom in choosing the sign of A z , lets 
us have (A' z , sgn(Z )) = -| (A z ,sgn(Z )) | for any given A 2 , such that | (A 2 , sgn(Z )) | = | (A' z , sgn(Z )) |. 
Therefore, the necessary and sufficient condition for A = to be the only solution of ( fT3T > is VA 2 E Tc s , 
|(A z ,sg n (n X))| < ||(A,) A ||i ■ 
Theorem 1: Any analysis operator Co : R" — > M. a is identifiable by (fT~4T >. where X = [Xi] ie[M] , such that 
Vi : supp(Jlxi) = Aj, a — |Aj| > q, is a training g-cosparse matrix, if VA 2 € 7c B , 

||(A,) A || 1 < IKA^^IU (18) 

where A = A c . 

Proof: This theorem is a generalisation of lemma [3] The proof is based on, firstly, finding a lower-bound for 
the left-hand side of dTD i and, secondly, generalising the lemma using any possible cosparsity pattern A. 
(sgn(fioX), A 2 ) is never greater than (sgn(A z ), A 2 ), as, 

(sgn(n X),A z > = K A ^)i,il s S I1 ( A ^i,iSgii(n X) jJ 

= E (A^ij sgn(A 2 )ij 
(j,i)eA 

= (sgn(A 2 ) A ,A 2 ) 

This assures that (sgn(fioX), A 2 ) < ||(A z )a||i- For any it assures that if ||(A 2 )a||i < ||(A z ) A ||i, J^o is 
identifiable based on lemma [3] Now, if we consider all fio providing g-cosparse training samples, with some 
possible cosparsity A, the inequality ( TT8l assures the identifiability of fig- ■ 
Remark 4: Note that, not all index sets of cardinality q can produce cosparsity patterns A, given f2o iTTTl . 
Similarly, not all sign patterns are feasible as sgn(fioX), when X is a variable matrix. As a result, the inequality 
condition of Theorem Q] i.e. ( fT8l ), is just a sufficient condition for the local identifiability of any Hq's, providing 
some g-cosparse training signals. While ( TPTt is the necessary and sufficient condition for the local identifiability, 
we can empirically show that ( fT8l is not tight, see Figure |4] 
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Appendix B 

Deriving a simple linear representation of the constraints 

The optimisation problem ([Tol l has three constraints, where the last two are clearly linear, and they can be 
represented as, 

^(A z ) Ifc (V ) fci = 0, Vie [1,*], je[l,l-m] 

k (19) 

^A Zlfc Q fcl =0, Vie[l,n], 
fc 

where Q := ViEj^Vi . Let A := Z Vi. The first constraint can now be reformulated as, VJA 2 A + 
A T A 2 Vi = 0. To derive this equation in a similar form to ([19), we reformulate A 2 Vi, then right-multiply 
(A 2 Vi) T with A. 

(V[A-), M = (A,) ir (Vi) r< 

l<r<l 

((vrADA) 4 ,= ^ ( ^ (A z ) fcr (Vx) ri ) A fej 
= E E (Vi)„A fcJ (A z ) fcr 

l<fc<Tl l<T<i 

We now reformulate the first constraint as, 

£ ]T ((Vi) r< A fcJ + (Y 1 ) rj A ki )(A z ) kr = 

!<(=<„ l<r<l (20) 

Vi, j G [1, m] 

We can now generate 1 S'( a+a ( I _„) + „2 )xoi using the weights in ( fT9l ) and d20l ), corresponding to A z , and make a linear 
presentation as 3f<^ = 0, ^ = vect(A 2 ). 
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